In this project, you will apply the skills you acquired in the course to gather and wrangle real-world data with two datasets of your choice.
You will retrieve and extract the data, assess the data programmatically and visually, accross elements of data quality and structure, and implement a cleaning strategy for the data. You will then store the updated data into your selected database/data store, combine the data, and answer a research question with the datasets.
Throughout the process, you are expected to:
In this section, you will extract data using two different data gathering methods and combine the data. Use at least two different types of data-gathering methods.
In this Project the objective is to observe crash and collision data of two big cities in the United States. Chicago and New York. The objective is to determine:
Which city has a higher kill count when crashes occur, Chicago or New york city?
Between both Chicago and NewYork. Which date or time of the year has recorded the highest fatalaties in general between both cities?
The two datasets that will be used for this project can be found on the Catalog-data-.gov website.
The Chicago dateset which can be found here: Chicago-motor-vehicle-crashes
The NewYork dataset which can be found here: NewYork-motor-vehicle-collisions
List of data gathering methods:
For this project the TWO Data Gathering methods that will be chosen from the list above for this project are:
Download data manually
Gather data by accessing APIs
Here, the Libraries will be imported.
# import libraries
import os
import numpy as np
import pandas as pd
import requests
import bs4
import sklearn
import PIL
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
import missingno as msn
%matplotlib inline
The first dataset that will be used is the Traffic Crahses in Chicago.
The data type file is a CSV file. The file name is traffic-crashes-chicago.csv
The method used here is to open the CSV file manually. The file has already been downloaded locally in the downloads folder. It it will be opened using Pandas load to CSV method.
# Using Pandas
sub = os.path.join('datasets', 'traffic-crashes-chicago.csv')
chicago_rawdata = pd.read_csv(sub)
# Load the dataset
chicago_rawdata.head()
| CRASH_RECORD_ID | RD_NO | CRASH_DATE_EST_I | CRASH_DATE | POSTED_SPEED_LIMIT | TRAFFIC_CONTROL_DEVICE | DEVICE_CONDITION | WEATHER_CONDITION | LIGHTING_CONDITION | FIRST_CRASH_TYPE | ... | INJURIES_NON_INCAPACITATING | INJURIES_REPORTED_NOT_EVIDENT | INJURIES_NO_INDICATION | INJURIES_UNKNOWN | CRASH_HOUR | CRASH_DAY_OF_WEEK | CRASH_MONTH | LATITUDE | LONGITUDE | LOCATION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 79c7a2ce89f446262efd86df3d72d18b04ba487024b7c4... | JC199149 | NaN | 03/25/2019 02:43:00 PM | 30 | TRAFFIC SIGNAL | FUNCTIONING PROPERLY | CLEAR | DAYLIGHT | TURNING | ... | 0.0 | 1.0 | 2.0 | 0.0 | 14 | 2 | 3 | 41.884547 | -87.641201 | POINT (-87.64120093714 41.884547224337) |
| 1 | 792b539deaaad65ee5b4a9691d927a34d298eb33d42af0... | JB422857 | NaN | 09/05/2018 08:40:00 AM | 30 | NO CONTROLS | NO CONTROLS | CLEAR | DAYLIGHT | ANGLE | ... | 0.0 | 0.0 | 2.0 | 0.0 | 8 | 4 | 9 | 41.968562 | -87.740659 | POINT (-87.740659314632 41.968562453871) |
| 2 | 0115ade9a755e835255508463f7e9c4a9a0b47e9304238... | JF318029 | NaN | 07/15/2022 12:45:00 AM | 30 | UNKNOWN | UNKNOWN | CLEAR | DARKNESS, LIGHTED ROAD | ANGLE | ... | 0.0 | 0.0 | 2.0 | 0.0 | 0 | 6 | 7 | 41.886336 | -87.716203 | POINT (-87.716203130599 41.886336409761) |
| 3 | 017040c61958d2fa977c956b2bd2d6759ef7754496dc96... | JF324552 | NaN | 07/15/2022 06:50:00 PM | 30 | TRAFFIC SIGNAL | FUNCTIONING PROPERLY | CLEAR | DAYLIGHT | REAR END | ... | 0.0 | 0.0 | 2.0 | 0.0 | 18 | 6 | 7 | 41.925111 | -87.667997 | POINT (-87.667997321599 41.925110815832) |
| 4 | 78eee027ec3dcc85d36c9e3fdae4729dcc56440105d65b... | JB291672 | NaN | 06/03/2018 05:00:00 PM | 30 | NO CONTROLS | NO CONTROLS | CLEAR | UNKNOWN | PARKED MOTOR VEHICLE | ... | 0.0 | 0.0 | 1.0 | 0.0 | 17 | 1 | 6 | 41.910758 | -87.731389 | POINT (-87.731388754145 41.910757551599) |
5 rows × 49 columns
From the above it can be seen that the traffic-crashes-chicago.csv dataset has been successfully loaded using Pandas. This is assigned to the variable dataset_one.
From above we picked this CHicago dataset because it is vast and the data seems more kept to date. May be being as Chicago is one of the biggest cities in the world its accident count must be pretty high so accident data must be vital.
The gathering method used from above was to download the dataset and load the CSV file locally from the PC.
SOme interesting variables picked up at a glance were CRASH_DATE, LOCATION and CRASH_MONTH
The next dataset used will be Motor Vehicle Collisions-Crashes in New York.
The datatype is a JSON file on a webpage.
Loading the data here will be slightly different from the previous method in Dataset 1. Here we will scrape the data from the API found on the webpage here Motor-Vehicle-Collisions-NYC which provides the data in JSON format. Then, the data will be converted to a Pandas Dataframe.
# THE URL Link where the data is located
URL_link = 'https://data.cityofnewyork.us/resource/h9gi-nx95.json'
# using Pandas to read into a JSON file
newyork_rawdata_json = pd.read_json(URL_link)
# Load the dataset
newyork_rawdata_json.head()
| crash_date | crash_time | on_street_name | off_street_name | number_of_persons_injured | number_of_persons_killed | number_of_pedestrians_injured | number_of_pedestrians_killed | number_of_cyclist_injured | number_of_cyclist_killed | ... | latitude | longitude | location | cross_street_name | contributing_factor_vehicle_3 | vehicle_type_code_3 | contributing_factor_vehicle_4 | vehicle_type_code_4 | contributing_factor_vehicle_5 | vehicle_type_code_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-09-11T00:00:00.000 | 2023-05-21 02:39:00 | WHITESTONE EXPRESSWAY | 20 AVENUE | 2 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2022-03-26T00:00:00.000 | 2023-05-21 11:45:00 | QUEENSBORO BRIDGE UPPER | NaN | 1 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2022-06-29T00:00:00.000 | 2023-05-21 06:55:00 | THROGS NECK BRIDGE | NaN | 0 | 0 | 0 | 0 | 0 | 0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2021-09-11T00:00:00.000 | 2023-05-21 09:35:00 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | ... | 40.667202 | -73.866500 | {'latitude': '40.667202', 'longitude': '-73.86... | 1211 LORING AVENUE | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2021-12-14T00:00:00.000 | 2023-05-21 08:13:00 | SARATOGA AVENUE | DECATUR STREET | 0 | 0 | 0 | 0 | 0 | 0 | ... | 40.683304 | -73.917274 | {'latitude': '40.683304', 'longitude': '-73.91... | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
When doing a project about accidents the first city I automatically searched for was Newyork. New York is the most robust city with the most traffic and a high population and high movement. So I had to include New york in my findings
The gathering method here as mentioned was download the data which a online JSON file on the website using a API. The data successfully downloaded on the machine. (Below I will make a optional step to save this data as a CSV locally0
Some interesting variables noticed at first glance were crash_date, crash_time and numbers_of_persons_injured
FInally, the New york dataset was not as big as the Chicago but it was still very fruitful with data and very much kept up to date.
Optional data storing step: Since DATASET 1 was loaded manually, the dataset is already saved locally. DATASET 2 was loaded from the URL. Therefore we will save the raw datasetlocally now.
#Optional: store the dataset locally from the downloaded Json file online
newyork_rawdata_json.to_csv('datasets\dataset_two.csv', index=False)
# load downloaded csv file of raw data
# This is now the raw data file
newyork_rawdata = pd.read_csv('datasets\dataset_two.csv')
print ('Successfully created Local backup for Newyork data!')
Successfully created Local backup for Newyork data!
Within the local data folder we now have our two raw datasets
Assess the data according to data quality and tidiness metrics using the report below.
List two data quality issues and two tidiness issues. Assess each data issue visually and programmatically, then briefly describe the issue you find. Make sure you include justifications for the methods you use for the assessment.
We are going to first inspect the two dataset variables by plotting visualizations. Here we will dirst the First quality issue which is seeing how much missing data is present in the two datasets respectively
#FILL IN - Inspecting the dataframe visually for chicago dataset
msn.bar(chicago_rawdata)
<Axes: >
msn.bar(newyork_rawdata)
<Axes: >
We can see the missing data from each of the datasets in two different ways. WE see the columns with missing rows of data.
# Using Isnull.sum() to see see the actually columns of missing data in each dataframe
chicago_rawdata.isnull().sum()
CRASH_RECORD_ID 0 RD_NO 4016 CRASH_DATE_EST_I 654628 CRASH_DATE 0 POSTED_SPEED_LIMIT 0 TRAFFIC_CONTROL_DEVICE 0 DEVICE_CONDITION 0 WEATHER_CONDITION 0 LIGHTING_CONDITION 0 FIRST_CRASH_TYPE 0 TRAFFICWAY_TYPE 0 LANE_CNT 509267 ALIGNMENT 0 ROADWAY_SURFACE_COND 0 ROAD_DEFECT 0 REPORT_TYPE 19672 CRASH_TYPE 0 INTERSECTION_RELATED_I 545779 NOT_RIGHT_OF_WAY_I 675089 HIT_AND_RUN_I 488097 DAMAGE 0 DATE_POLICE_NOTIFIED 0 PRIM_CONTRIBUTORY_CAUSE 0 SEC_CONTRIBUTORY_CAUSE 0 STREET_NO 0 STREET_DIRECTION 4 STREET_NAME 1 BEAT_OF_OCCURRENCE 5 PHOTOS_TAKEN_I 699447 STATEMENTS_TAKEN_I 693339 DOORING_I 706117 WORK_ZONE_I 704150 WORK_ZONE_TYPE 705048 WORKERS_PRESENT_I 707204 NUM_UNITS 0 MOST_SEVERE_INJURY 1541 INJURIES_TOTAL 1530 INJURIES_FATAL 1530 INJURIES_INCAPACITATING 1530 INJURIES_NON_INCAPACITATING 1530 INJURIES_REPORTED_NOT_EVIDENT 1530 INJURIES_NO_INDICATION 1530 INJURIES_UNKNOWN 1530 CRASH_HOUR 0 CRASH_DAY_OF_WEEK 0 CRASH_MONTH 0 LATITUDE 4553 LONGITUDE 4553 LOCATION 4553 dtype: int64
chicago_rawdata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 708268 entries, 0 to 708267 Data columns (total 49 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRASH_RECORD_ID 708268 non-null object 1 RD_NO 704252 non-null object 2 CRASH_DATE_EST_I 53640 non-null object 3 CRASH_DATE 708268 non-null object 4 POSTED_SPEED_LIMIT 708268 non-null int64 5 TRAFFIC_CONTROL_DEVICE 708268 non-null object 6 DEVICE_CONDITION 708268 non-null object 7 WEATHER_CONDITION 708268 non-null object 8 LIGHTING_CONDITION 708268 non-null object 9 FIRST_CRASH_TYPE 708268 non-null object 10 TRAFFICWAY_TYPE 708268 non-null object 11 LANE_CNT 199001 non-null float64 12 ALIGNMENT 708268 non-null object 13 ROADWAY_SURFACE_COND 708268 non-null object 14 ROAD_DEFECT 708268 non-null object 15 REPORT_TYPE 688596 non-null object 16 CRASH_TYPE 708268 non-null object 17 INTERSECTION_RELATED_I 162489 non-null object 18 NOT_RIGHT_OF_WAY_I 33179 non-null object 19 HIT_AND_RUN_I 220171 non-null object 20 DAMAGE 708268 non-null object 21 DATE_POLICE_NOTIFIED 708268 non-null object 22 PRIM_CONTRIBUTORY_CAUSE 708268 non-null object 23 SEC_CONTRIBUTORY_CAUSE 708268 non-null object 24 STREET_NO 708268 non-null int64 25 STREET_DIRECTION 708264 non-null object 26 STREET_NAME 708267 non-null object 27 BEAT_OF_OCCURRENCE 708263 non-null float64 28 PHOTOS_TAKEN_I 8821 non-null object 29 STATEMENTS_TAKEN_I 14929 non-null object 30 DOORING_I 2151 non-null object 31 WORK_ZONE_I 4118 non-null object 32 WORK_ZONE_TYPE 3220 non-null object 33 WORKERS_PRESENT_I 1064 non-null object 34 NUM_UNITS 708268 non-null int64 35 MOST_SEVERE_INJURY 706727 non-null object 36 INJURIES_TOTAL 706738 non-null float64 37 INJURIES_FATAL 706738 non-null float64 38 INJURIES_INCAPACITATING 706738 non-null float64 39 INJURIES_NON_INCAPACITATING 706738 non-null float64 40 INJURIES_REPORTED_NOT_EVIDENT 706738 non-null float64 41 INJURIES_NO_INDICATION 706738 non-null float64 42 INJURIES_UNKNOWN 706738 non-null float64 43 CRASH_HOUR 708268 non-null int64 44 CRASH_DAY_OF_WEEK 708268 non-null int64 45 CRASH_MONTH 708268 non-null int64 46 LATITUDE 703715 non-null float64 47 LONGITUDE 703715 non-null float64 48 LOCATION 703715 non-null object dtypes: float64(11), int64(6), object(32) memory usage: 264.8+ MB
Here we will create the threshold. WE cannot depend on magic numbers or bias. So we use a proportionate amount of 25%. Below we will find what is 25% of the total row count of the chicago dataset. Whatever the output is our threshold. SO this means any column with NaN values exceeding that threshold will be dropped!
# For chicago data threshold
25/100 * 708368
print(' Any column yielding a greater or equal number of missing values than `177092` will be dropped from the Chicago dataset.')
Any column yielding a greater or equal number of missing values than `177092` will be dropped from the Chicago dataset.
Now for New york data...
newyork_rawdata.isnull().sum()
crash_date 0 crash_time 0 on_street_name 258 off_street_name 549 number_of_persons_injured 0 number_of_persons_killed 0 number_of_pedestrians_injured 0 number_of_pedestrians_killed 0 number_of_cyclist_injured 0 number_of_cyclist_killed 0 number_of_motorist_injured 0 number_of_motorist_killed 0 contributing_factor_vehicle_1 3 contributing_factor_vehicle_2 230 collision_id 0 vehicle_type_code1 14 vehicle_type_code2 345 borough 367 zip_code 368 latitude 76 longitude 76 location 76 cross_street_name 742 contributing_factor_vehicle_3 909 vehicle_type_code_3 919 contributing_factor_vehicle_4 973 vehicle_type_code_4 976 contributing_factor_vehicle_5 994 vehicle_type_code_5 995 dtype: int64
newyork_rawdata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 crash_date 1000 non-null object 1 crash_time 1000 non-null object 2 on_street_name 742 non-null object 3 off_street_name 451 non-null object 4 number_of_persons_injured 1000 non-null int64 5 number_of_persons_killed 1000 non-null int64 6 number_of_pedestrians_injured 1000 non-null int64 7 number_of_pedestrians_killed 1000 non-null int64 8 number_of_cyclist_injured 1000 non-null int64 9 number_of_cyclist_killed 1000 non-null int64 10 number_of_motorist_injured 1000 non-null int64 11 number_of_motorist_killed 1000 non-null int64 12 contributing_factor_vehicle_1 997 non-null object 13 contributing_factor_vehicle_2 770 non-null object 14 collision_id 1000 non-null int64 15 vehicle_type_code1 986 non-null object 16 vehicle_type_code2 655 non-null object 17 borough 633 non-null object 18 zip_code 632 non-null float64 19 latitude 924 non-null float64 20 longitude 924 non-null float64 21 location 924 non-null object 22 cross_street_name 258 non-null object 23 contributing_factor_vehicle_3 91 non-null object 24 vehicle_type_code_3 81 non-null object 25 contributing_factor_vehicle_4 27 non-null object 26 vehicle_type_code_4 24 non-null object 27 contributing_factor_vehicle_5 6 non-null object 28 vehicle_type_code_5 5 non-null object dtypes: float64(3), int64(9), object(17) memory usage: 226.7+ KB
As it it was done for the Chicago dataset above. This will be done here again with the New york data to find the 25% threshold to use to determine which columns with too many NaN values will be dropped
# For new york
25/100 * 1000
print(' Any column yielding a greater number of missing values than `250` will be dropped from the New york dataset. ')
Any column yielding a greater number of missing values than `250` will be dropped from the New york dataset.
From the above we plotted visualizations for both datasets visualizating the missing data. Secondly, we programmtically analyzed the missing values of each column.
The biggest question here to avoid Bias is : "How the data will be dropped and columns will be dropped? " with the most logical way possible.
Not columns will be used because as seen some columns have a lot of missing data so we deem them to be "unecessary"
Using info() showed us the range index of the column which showed us the amount of rows in the column. So for example. For chicago data we saw the figure of 708268 that meant there are 708268 rows in each column. The same applies for New york there are 1000 rows of data in each column.
From knowing this information we will use a threshold. IF more than 25% of the data is missing or in NAN we drop the column completely. So whatever column yields that number greater it will be dropped. See code below for logic to negate magic numbers and form our threshold numbers to drop columns.
dropna()Let us just think ahead for a quick moment. If we were to merge the two datasets at the moment. How would we be able to distinguish the data? For example how would we know which crash happened in Newyork while one happened in CHicago. Yes we have the longitude and lattitude but there is no indication helping to identify the city. If a reader was to read the dataset. THey would have no idea which city it is from. This is a big quality issue and needs to be attended too.
#FILL IN - Inspecting the dataframe visually via heatmaps
# First chicago data
msn.heatmap(chicago_rawdata, cmap='YlGnBu')
<Axes: >
# Visualizting new york city columns via Heatmap
msn.heatmap(newyork_rawdata, cmap='YlGnBu')
<Axes: >
WE have been looking at the two datasets for some time now to not notice that there is no any mention about a city name anywhere. Even from the visualizations above we do not seen any column pertaining to or mentioning anything to do with city name. So how can you distinguish the datasets by city. Here we will have create a column to help with this.
Below is a programmtical test example using a test dataframe of how the column will be created and will set the expectation on what to expect for the Chicago and New york datasets respectively
# Creating a test to programtically show
test_data_1 = {'name': ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'],
'age': [20, 19, 22, 21],
'favorite_color': ['blue', 'red', 'yellow', "green"],
'grade': [88, 92, 95, 70]}
test_df_1 = pd.DataFrame(test_data_1, index = ['Willard Morris', 'Al Jennings', 'Omar Mullins', 'Spencer McDaniel'])
# Creating/Adding the new column as a test
test_df_1['City Name'] = 'City Name Here'
test_df_1
| name | age | favorite_color | grade | City Name | |
|---|---|---|---|---|---|
| Willard Morris | Willard Morris | 20 | blue | 88 | City Name Here |
| Al Jennings | Al Jennings | 19 | red | 92 | City Name Here |
| Omar Mullins | Omar Mullins | 22 | yellow | 95 | City Name Here |
| Spencer McDaniel | Spencer McDaniel | 21 | green | 70 | City Name Here |
AS seen there is a constant value in the new test column city. It is required for the data to be constant for the analysis.
Remember this is just an example!
From the heatmaps above we can see the column names clearly. If the data was going to be observed from any reader or third person they will not know this crash data was from New york city or Chicago city.
Taking it further checking the column list of names of the datasets. There is no indication stating the city name
The solution to this would be add a new column with a constant value the column name could be CityName and help distinguish and explicitly state the city name. So for example having a column ename City in each dataset and having the data be NEwyork and Chicago respectively.
CRASH_DATE column in the Chicago dataset has both date and time in it. Secondly, the Newyork dataset has two date and time columns but one is unclear.¶NOTE: The objective here is to keep the data separate from the time. It is confusing with both date and time in the same column.
We will first visually display the CRASH_DATE column in a visualization (boxplot) to actually get a visual understanding of how the data actually looks.
# FILL IN - Inspecting the data visually
# Visualizing seeing how the data looks on the axis of a visualization
fig1 = px.box(chicago_rawdata, y='CRASH_DATE')
fig1.show()